Confidence Estimation for Automatic Speech Recognition Hypotheses

نویسنده

  • Matthew Stephen Seigel
چکیده

Automatic speech recognition (ASR) systems produce transcriptions for audio which sometimes contain errors. It is useful to know how much condence may be placed in this output being correct. Condence estimation is concerned with obtaining scores which quantify this level of condence. e development and application of a principled, žexible framework using conditional random eld (CRF) models for condence estimation is described. Errors tend to occur over a number of consecutive words in ASR output. is phenomenon is not typically accounted for in condence estimation, but is exploited here through the sequential nature of the CRF. A custom CRF framework is developed, making it possible for useful feature functions to be engineered. is framework is extended to support hidden-state CRFs. To inform this condence estimation model, novel predictor features indicative of the quality of ASR hypotheses are proposed, along with a technique for their extraction from lattices. e CRF-based approach is used to combine multiple predictor features and estimate condence scores for words in ASR hypotheses. is yields performance improvements in the normalised cross entropy (NCE) metric of up to 11.4% relative to a strong baseline (using decision trees). e novel application of a hidden-state CRF to this task yields further relative improvements of up to 17.2%. Estimating condence scores on the sub-word-level is also investigated. Sub-word-level features are combined with word-level features to yield improvements of up to 31.7%relative. e use of a hiddenstate CRF for this task yields even larger relative gains of up to 48.6%. e application of CRFs to estimate keyterm condence scores for spoken term detection is proposed. Discriminative features for keyterm hypotheses are introduced, as well as a model-based approach to keyterm score normalisation. is approach results in improvements of 26% and 36% relative in the miss rate and false alarm rate at operating points of interest. e novel task of detecting deletions within ASR output is investigated. e sequential nature of the CRF is exploited to make this possible, such that regions in which deletions occur are modelled. Modelling word condence and deletion regions simultaneously yields an approach which is capable of detecting deletions. Overall, the proposed framework for condence estimation is shown to yield improved condence estimates. is is important for downstream applications (e.g. dialogue systems, keyterm detection) which make decisions based on these scores, as well as in-system applications (e.g. data selection and adaptation).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Driving ROVER with Segment-based ASR Quality Estimation

ROVER is a widely used method to combine the output of multiple automatic speech recognition (ASR) systems. Though effective, the basic approach and its variants suffer from potential drawbacks: i) their results depend on the order in which the hypotheses are used to feed the combination process, ii) when applied to combine long hypotheses, they disregard possible differences in transcription q...

متن کامل

Combining Information Sources for Confidence Estimation with CRF Models

Obtaining accurate confidence measures for automatic speech recognition (ASR) transcriptions is an important task which stands to benefit from the use of multiple information sources. This paper investigates the application of conditional random field (CRF) models as a principled technique for combining multiple features from such sources. A novel method for combining suitably defined features ...

متن کامل

A Database for Automatic Persian Speech Emotion Recognition: Collection, Processing and Evaluation

Abstract   Recent developments in robotics automation have motivated researchers to improve the efficiency of interactive systems by making a natural man-machine interaction. Since speech is the most popular method of communication, recognizing human emotions from speech signal becomes a challenging research topic known as Speech Emotion Recognition (SER). In this study, we propose a Persian em...

متن کامل

Ginisupport vector machines for segmental minimum Bayes risk decoding of continuous speech

We describe the use of Support Vector Machines (SVMs) for continuous speech recognition by incorporating them in Segmental Minimum Bayes Risk decoding. Lattice cutting is used to convert the Automatic Speech Recognition search space into sequences of smaller recognition problems. SVMs are then trained as discriminative models over each of these problems and used in a rescoring framework. We pos...

متن کامل

Automatic quality estimation for ASR system combination

Recognizer Output Voting Error Reduction (ROVER) has been widely used for system combination in automatic speech recognition (ASR). In order to select the most appropriate words to insert at each position in the output transcriptions, some ROVER extensions rely on critical information such as confidence scores and other ASR decoder features. This information, which is not always available, high...

متن کامل

Word Confidence Estimation for Speech Translation

Word Confidence Estimation (WCE) for machine translation (MT) or automatic speech recognition (ASR) consists in judging each word in the (MT or ASR) hypothesis as correct or incorrect by tagging it with an appropriate label. In the past, this task has been treated separately in ASR or MT contexts and we propose here a joint estimation of word confidence for a spoken language translation (SLT) t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013